计算机与现代化 ›› 2012, Vol. 198 ›› Issue (2): 128-130.doi: 10.3969/j.issn.1006-2475.2012.02.034

• 网络与通信 • 上一篇    下一篇

一种基于单模型的网页净化方法

干文敏1,李俊1,李剑2   

  1. 1.南京航空航天大学计算机科学与技术学院,江苏 南京 210016; 2.南昌陆军学院战斗实验室,江西 南昌 330103
  • 收稿日期:2011-10-21 修回日期:1900-01-01 出版日期:2012-02-24 发布日期:2012-02-24

A Method of Web Page Purification Based on Single Model

GAN Wen-min1, LI Jun1, LI Jian2   

  1. 1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China;2. Battle Laboratory, Nanchang Army College, Nanchang 330103, China
  • Received:2011-10-21 Revised:1900-01-01 Online:2012-02-24 Published:2012-02-24

摘要: 为了能够更好地获得和处理网页中的正文信息,本文提出基于改进的DOM树和BP神经网络的网页净化算法。该算法根据DOM树和网页内容的特征用HTMLParser把网页转换成一棵内容块树。因网页子内容块具有相当明显的数值特征,可以通过BP神经网络建立网页噪音信息过滤模型。这样使得网页净化更加模型化,也能够取得更加好的效果。

关键词: 网页净化, DOM树, 内容块, 神经网络

Abstract: In order to obtain and handle with the information in Web pages effectively, this paper proposes the algorithm of Web page purification based on improved DOM tree and BP neural network.This algorithm establishes block tree by DOM tree and Web content using HTMLParser.Because of the evident numerical characteristics in subblocks of Webpages, it can establish noisy purifymodel by BP neural network. As a result, it can make the Webpage purification more modelling, also it can get a more effective result.

Key words: Web page purification, DOM tree, content block, neural network

中图分类号: